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Abstract 

Parsing human poses in images is fundamental 
in extracting critical visual information for artifi- 
cial intelligent agents. Our goal is to learn self- 
contained body part representations from images, 
which we call visual symbols, and their symbol- 
wise geometric contexts in this parsing process. 
Each symbol is individually learned by categoriz- 
ing visual features leveraged by geometric informa- 
tion. In the categorization, we use Latent Support 
Vector Machine followed by an efficient cross val- 
idation procedure to learn visual symbols. Then, 
these symbols naturally define geometric contexts 
of body parts in a fine granularity. When the struc- 
ture of the compositional parts is a tree, we derive 
an efficient approach to estimating human poses in 
images. Experiments on two large datasets suggest 
our approach outperforms state of the art methods. 



1 Introduction 

Parsing human poses in images h as been a c lassical topic in 
artificial intelligence for decades |Marr, 1982.1. This research 
facilitates a num ber of fundamental stud ies ranging from vi- 
sual perception ||D , Ausilio"eF a/., 20121 to computer vision 
I Felzenszwalb and Huttenloc her, 20051, to particu larly cog- 
nitive robotics in recent years | |Jenkins et al., 2007| . 

We focus on learning visual representations of body parts 
in thi s parsing process, which we call visual symbols. [Marr, 
19821 has already argued that any meaningful representa- 
tion of the human body should be self-contained in a se- 
mantic hierarchy. In his work, the main ingredients of "self- 
containedness" are (i) self-contained unit must have a limited 
complexity, such that (ii) information appears in geometric 
contexts appropriate for recognition, and (iii) the representa- 
tion can be processed flexibly. 

However, recent research is contrary to this intuitive phi- 
losophy. The body parts are frequently represented by plain 
cardboard models {e.g., joints or limbs only). Since each part 
is not distinctive, the visual units are considered as an ap- 
proach for computing probability of the body part locations, 
and the geometric contexts are coarsely defined as simple dis- 
tributions between parts. As such, the inference models may 



have to go beyond tree structures to model long rang e inter- 
actions in this coarse structure {e.g., ISun et ah, 2012)). This 



essentially makes the problem less tractable, and approximate 
inference has to be adopted. 

Can we still use exact and fast inference to remedy the 
problems caused by the deformable nature of human beings? 
We propose to use compositional parts and exploit symbol- 
wise geometric contexts map for effective pose parsing in still 
images. Our part representation is compositional, i.e., each 
part may contain one or more physical joints of the human 
body, and the relationship between two parts can be either 
hierarchical (parent-child, e.g., leg and upper leg, Fig. fT) or 
flat. This allows us to categorize distinctive visual features 
for body parts, which are the descriptors that characterize the 
properties of image patches, and eventually model pairwise 
interactions in fine granularity. 

A popular view of the geometric context is that pairwise 
relationship between two parts can be characterized by a dis- 
tribution. Fig. Th shows a distribution of relative locations 
between upper/lower leg (denoted by green/blue colors) and 
"leg" (denoted by cyan), respectively. 

It is seemingly legitimate to assume both point sets sat- 
isfy normal distributions, but let us take a further look at the 
data. Assuming we have learned the symbols of the "leg" 
as depicted in Fig. [it. We can group all the instances in 
each point set to a few categories (Fig. [TJ; and [TJl). Fur- 
ther, we redraw the relative locations of upper/lower leg that 
are only associated to the corresponding symbols of the leg 
(Fig. [T£). Clearly, we have two observations: 1) relative lo- 
cations may have different distributions and exhibit different 
characteristics, and 2) some categories may have similar dis- 
tributions, but they are compactly distributed and have much 
smaller variances compared to Fig. [Tb. Therefore, it is much 
easier to model these subsets separately. 

The concept of symbols, although it is not new in vision 
and artificial intelligence, has not been used explicitly in 
many state of the art pose estimation methods. Such a set 
of learned symbols enables encoding the symbol-wise geo- 
metric information in a finer scale, and provides more infor- 
mation in inference. Therefore, it is critical to learn symbols 
for compositional parts. 

With the help of geometric information, we categorize vi- 
sual features from part instances, and use cross validation to 
select the best categories. We used Histogram of Oriented 
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Figure 1 : Our motivation, (a) human body. Green denotes upper leg, Blue denotes lower leg. (b) relative distances between 
upper/lower leg and leg in a large dataset, respectively. If we can group the instances to symbols (c)(d), we can easily see that 
relative distances can be modeled in a fine scale (e). The coordinate is defined in the image space (pixels). 



Gradient (HOG) flDalal and Triggs, 2005) in this paper. HOG 
is frequently used as the feature for the appearance model of 
body parts. In this descriptor, an image is divided into smaller 
regions called cells, and the histogram of gradient directions 
of the pixels within each cell is calculated as the descriptor. 
Our approach has the following contributions: 

• We explore an effective procedure for learning self- 
contained symbols of body parts in parsing poses. 

• Our symbol-wise context map naturally encodes the le- 
gitimate combinations of human poses from images. 

• Our representation is very flexible, therefore, it is com- 
patible with the majority of the popular inference algo- 
rithm in human pose estimation. 

Following human kinematic structure, we derive an approach 
to effectively estimate human poses. We demonstrate the per- 
formance of our method in two large datasets, and our method 
outperforms the state of art methods. 

2 Related work 

Marr is among the first to propose a hierarchical organization 
of body for parsing human poses [Marr , 1982) . Each model 
in this hierarchy is a self-contained unit, and the geometric 
contexts among these units are designed for recognition. His 
structure motivated a number of approaches in computer vi- 
sion and machine learning. This idea evolves to deformable 
models in recent years, where pose estimation has been for- 
mulated as a part based inference problem. 

Pictorial Struc ture Model (PSM) I Felzens zwalb and Hut- 
Itenlocher, 20051 is one of the most successful deformable 
models. A tree structure graphical model is used, and pair- 
wise terms are based on the relative distances between corre- 



sponding body parts. I Yang and Ramanan, 201 1 1 proposed a 
mixtures-of-parts model for articulated pose estimation. In- 
stead of modeling b oth location and orien tation of body limbs 
as rigid parts (e.g., |Andriluka et al, 200 91), they used non- 
oriented pictorial structures with co-occurrence constraints. 



Their work relies on the geometry to define clusters (called 
"types" in their paper). Therefore, the representation is less 
self-contained from Marr's point of view. 

Fol lowing this research direction, |Sun and Savarese, 
|2011) used predefined symbols for a simultaneous detection 



of bo dy parts and estimation of human poses. |Tian et al. 
2012) used compatibility maps in a tree structure. Latent 



nodes encode compatibility between parts, and accuracy is 
improved because incompatible poses are pruned. However, 
their "types" variables are solely based on the geometry, and 
do not encode visual information. 

Beyond tree struc tures, graphical models w ere proposed 
in pose estimation. |Tran and Forsyth, 2010 1 evaluated the 
performance of human parsin g in full relational model. A 
recent work by |Sun et al, 2012 1 showed the solution of a 
loopy star model using Branch-and-Bound strategy. These 
structures usually lead to better performance, but all require 
efficient approximate inference methods. 

All above methods use d various definitions of card-board 
style parts, such as limbs I Andriluka et al., 20091 and joints 
I Yang and Ramanan, 201 1) . This may appear easier to model, 
but essentially the features are less distinctive and the perfor- 
mance is limited by coarse geometric context. Beyond these 
plain structures, researchers seek to compositional parts in re- 
cent years to characterize higher level visual representations. 

I Wang et al., 20 111 proposed to use compositional parts 



to provide more precise results, but his method has a higher 
computation cost due to the loopy graph structure. iRothrock 



and Zhu, 20111 cast human pose into AND/OR graphs, and 
performed human parsing using top-down scheme. Rich 
appearance models w ere adopted to estimate human parts. 
iBourdev et al, 20101 proposed to use poselets for human 



recognition. Each poselet refers to a combined part that is 
distinctive in training images. Please note that these poselets 
do not characterise geometric contexts in modeling pairwise 
distributions, which makes it less effective to fully capture the 
body dynamical structures. 




Categorization 
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Figure 2: Training visual symbols. Given a set of instances of a compositional part (a), our approach categorizes these instances 
(b) and summarize it to symbols by a set of linear filters (c). (d) Our tree structure model for parsing human poses. Semantically, 
there are two "high level" parts (red), nine "mid level" parts (orange), and 14 joints (blue) in total. 



3 Approach 

Define the set of M parts as P = {Pi}, i e [1, ..., M]. One 
of our goals is to learn a symbol set S such that an instance 
of a body part can be labelled by an entry in S. 

In many state of the art datasets, the locations of primitive 
parts (e.g., joints) are manually annotated. Therefore, the goal 
of our approach is to learn and assign symbols for composi- 
tional parts in the training set, and to detect compositional 
parts in test images by inferring their symbols. 

We first present our approach to learning visual symbols 
for compositional parts, then we derive the compatibility map 
for fine scale modeling of geometric context. Finally, we 
adopt an efficient learning and inference method when the 
structure of graphical models is a tree. 

3.1 Learning symbols for compositional parts 

Let G = (P, E) denote the relationship graph, where P de- 
notes the body parts, and E is the set of edges that denote the 
pairwise relationship between parts. An instance of a com- 
positional part pi = (loCi,Si), where loc% can be used for 
computing local geometric context (relative distance) in G, 
and Si denotes visual appearance. 

We exploit the advantages of both geometric and visual 
information. We first use geometric information to coarsely 
group image patches for pi to different clusters, then we cat- 
egorize the visual features in each cluster to visual symbols. 
This is more computational efficient than first categorizing 
visual features then grouping locations to symbols. 

Geometric grouping 

Given a pair of connected parts (pi , pj ) in G, we use the first 
part pi as reference, and calculate the relative locations of pj 
with respect to pi. As a result, all samples of pj are projected 
into 2D geometry space. 

For efficiency consideration, we run k-mean to group sam- 
ples in kj geometry clusters, each of which denotes a geome- 
try type of pj. 

Discovering visual categories 

Instances for compositional parts still have large variations in 
appearance within a geometric type. Therefore, the geometry 
alone is not powerful to characterize symbols, and we need 



to further learn visual categories in each geometric group for 
generating more discriminative visual symbols. 

Fig. [2] illustrates the learning process. Given a set of in- 
stances of a part (Fig. |2k), our approach categorizes these 
instances to a number of subsets that are meaningful both in 
geometric context and visual appearance (Fig. |2k), and sum- 
marize it to symbols by a set of linear weights (Fig. |2t) 

Let (j>(I,pi) denotes the visual feature of p. b in the image /. 
For instances of pt within the same geometry context, we aim 
at learning linea r classifiers to catego rize visual features. 

We followed iDivvala et at, 20121 and built a Latent Sup- 
port Vector Machine model for learning visual subcategories. 
Given Ni positive instances of a compositional part, and N2 
negative instances, we learn K subcategories of this positive 
set. This allows us to generate the labels L = l\, I2, ■ ■ ■ , In> 
li G [l,K] for each instance. Our objective function is as 
follows 
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U = argmaxwj-^p,) 



where yi = {1,-1} denotes whether yi is from positive or 
negative sample sets, and w k are the weights of the feature 
map for each part. Similar as other clustering methods, we 
use k-mean for initializing categorization. 

Cross validation 

To achieve effective training when number of visual samples 
is small, we turn to a cross-validation learning paradigm to 
discover the best classifiers and fine tune the performance. 
The main idea of cross validat ion is to perform a training step 
followed by a validation step [Singh et a l., 2012) . During the 
validation stage, each classifier is evaluated on the validation 
set, and weak classifiers with few detected samples will be 
removed from the classifier set. 

The whole training process is conducted iteratively. Al- 
gorithm [T] illustrates the whole training process. After cross 
training the "survived" classifiers canbe regarded as a mixture 
of visual symbols. 



Algorithm 1 Cross validation. 



Input: 

H : training set for a compositional part Pf, 

Ki n '■ the number of classifiers. 
Output: 

K out : the number of visual categories; 

w k : linear classifiers k = [1, ...,K out \. 
Procedure: 

1 . Divide the training set equally to Hi and H^ ; 

2. Train the classifiers u>[i,. ..,#<„] on H\ using Eq. fll 

3. Evaluate the classification result on Hi\ 

4. If detected samples for wi is small: 

Remove Wi from classifiers and K \ n — Ki n — 1 ; 

5. Swap Hi and H2', 

6. Repeat step 1-5 for t times (t = 10 in our experiments); 

7. K out = K in and output W[i } _ iKout ] ; 



Discussion 

The visual categorization process of a compositional part 
characterizes the appearance models in a way that they can 
be regarded as "templates". When HOG feature is used, the 
set of learned weights is also considered as "HOG filters". 

In this way, our symbols encode both geometric and visual 
appearance information. This makes our descriptors different 
from other work, because they are more discriminative and 
representative. 

3.2 Defining symbol-wise geometric context 

Assigning each part a symbol allows us to build a compat- 
ibility map for any pair of symbols. Assume we have two 
compositional parts pi and pj, each of which has symbols 
Si and Sj, respectively, we create the pairwise compatibility 
term between parts as follows. 

D(I,Pi,Pj) = Uij 8i ipij(pi,Pj) + 6-f J , (2) 

where il>ij(pi,Pj) — [dx , dy , dx 2 , dy 2 ] denotes the rela- 
tive distance between pi and pj , u) i * 3 denotes the symbol- 
specific weights, and &■* 3 denotes the bias of the compatibil- 
ity. If two symbols are not "compatible", i.e., they never exist 
in any training image together, this bias term is —00. Both 
terms are learned in Sec. 13.31 

In graphical models, we frequently model the energy min- 
imization problem by passing messages from one node to the 
other. This message passing step is practical in both exact in- 
ference or inexact inference. We can utilize this compatibility 
term, which result in fine scale message passing. For node Pj 
, the incoming message rrik^yjipj, Sj) from other nodes P& 
and the outgoing message rrij(pj, Sj) are computed as: 



rrijip^Sj) 



m k ->j(Pj,8j) = max 



ken(j) 



m k ^j(j>j,. 



(3) 



max [m k (p k ,s k ) + D(I, Pk,Pj)] 



Pi, 



where n(j) denotes the neighbors of pj in G. 



(4) 



3.3 Inference 

Given a set of visual symbols, compatibility map, and their 
graph structure G, one can learn the parameters and perform 
inference. When the structure is a tree, the inference is exact. 
In our experiment, we define the following compositional 
parts (Fig. l2tl). Semantically, our structure has three levels: 
"upper body" and "lower body" as a coarse modeling of the 
human body, head, upper and lower limbs used in midlevel 
description, and joints in the fine level. 

Appearance term We use HOG templates to represent 
each visual subcategory. For each part pi in an im age J, the 
appearance score of a local patch can be written as I Yang and 
Ramanan, 2011 1 



B(J,p i ) = w < -0(J > ft), 



(5) 



where a;** is the weight for symbol s, in the i th part. This 
term can be initialized by the results w k in Eq. [T] 

Deformable term Pairwise term between pi,pj is defined 
as the symbol-wise context (Eq. E). This term can be com- 
puted effectively by distance transform in inference. 

Objective function Our objective function is as follows 

p= argmax^B(J,pi) +^T f D(I,p i ,p j ) (6) 

Since our model is a tree, standard message passing algo- 
rithm (Eq. [3] and Eq. |4} and exact inference are applicable. 

Learning model parameters The objective function (Eq. 
[6]) can be rewritten as / = (6,$>) , where (■, ■) is the in- 
ner product, 9 consists of image filters for single parts (w?*)> 
pairwise deformable weights (u$* J ) and biases &*." Sj , and 
$>(I n ,p) denotes the concatenated features from appearance 
and deformable components. The learning of 6 amounts to 
the quadratic optimization: 



N 



arg min -0 T -9 + Cy^ & 



n=l 



1 
9,&>0 2 

Vnepos 8 T $(I n ,p) > l-£„, 
Vneneg 6 T <t>(I n ,p) < -1 + £„. 



(7) 



This is a standard quadratic programming procedure, and 
can be solved effectively. 

4 Experiments 

We present our experiments in this section. First, we describe 
the datasets we used for evaluation. Then, we demonstrate 
the visual symbols learned by our method, and compare our 
approach against four other methods. 

In our experiments, we extract HOG features on grid image 
with 4x4 pixels from image patches, and learn visual sym- 
bols and geometric context map. The number of geometric 
cluster kj is 8 for large compositional parts, and 6 for small 




Figure 3: We show filters for our compositional parts: torso 
(yellow), lower body (orange), head (red), upper arm (cyan), 
lower arm (magenta), upper leg (green), lower leg (blue). 
Left/right side are denoted by solid/dashed lines, respectively. 

parts, and the number of visual symbols for each geometric 
types is set to 2 and 4, respectivel y. The number of geometri c 
clusters is consistent with that in I Yang and Ramanan, 201 1 1 . 
The final number of the appearance clusters depends on the 
cross validation. As a result, we learn 8 to 20 visual cate- 
gories for each visual symbol after cross validation. 

4.1 Dataset 

We evaluate our performance on two large datasets, namely, 
Image Parse dataset |Ra manan, 2007J and Leeds Sport 
dataset I Johnson and Everingham, 20101. In all experiments, 
we used 500 images in the negative s et of INRIA person 
dataset as negative samples (I Dalai and Triggs, 2005 1) 



Image Parse dataset 

Image Parse dataset (PARSE) contains 305 images with an- 
notated poses. This dataset has images from various human 
activities, background and different illuminations. All images 
are resized such that human in images have roughly the same 
scale (150 pixels in height). We used 100 images for training, 
and the rest for testing. 

LSP dataset 

The recent Leeds Sport Dataset (LSP) contains 2000 images. 
This collection has a larger variation of pose changes. Hu- 
mans in each image were cropped and scaled to 100 to 150 
pixels in height. The dataset was split evenly into training set 
and testing set. 

4.2 Demonstration 

We demonstrate the effectiveness of learning procedure in 
this section. Fig. [3] shows localization results for the eleven 
higher level compositional parts, as well as examples of their 
filters learned by our method. In this visualization, we use 
different colors and line types to denote different parts. 

Filters for visual symbols Each filter in Fig. [3] exhibits a 
few characteristics for the corresponding compositional part. 
These filters are related, in the sense that they model the hu- 
man body at different levels. Each filter is also self-contained, 



e.g., any one is not the sub-region of another due to the train- 
ing process. This intrinsic constraint facilitates the inference. 
For instance, the torso (yellow) filters indicate the body in- 
clination in a coarse level, which limits the search of head 
position (red) both in geometric context and appearance. 

Interpretation The localization results can be regarded as 
multi level symbolic annotations for the human body. There- 
fore, this can be used for a number of applications ranging 
from high level understanding to low level description. We 
can further assign semantic meaning to these symbols. For in- 
stance, the torso and lower body locations can be used for an- 
alyzing the actions of the human beings ("stretching", "stand- 
ing", "squat", etc), and the midlevel limb detection results are 
used for motion analysis ("extending arm", "fetching", etc.). 

4.3 Comparison 

We compared our approach against four state of the art meth- 
ods for human poses parsing on the PARSE and the LSP 
dataset i n this section. In ou r comparison, we used the cri- 
terion in I Ferrari et al, 20081 for performance evaluation. A 
part is correctly detected if both its endpoints are within 50% 
of the length of corresponding ground truth segments. Then 
we used the probability of a correct pose (PCP) to measure 
the percentage of correctly localized body parts. 

Table [T] summarizes the evaluation results, with highest 
scores being highlighted. We compared the parsing accuracy 
of our method agains t I Andriluka et al. , 2009 1 , 1 Yang a nd Ra 



manan, 2011 1, | Johnson and Everingham, 20101, and |Tian et 
al, 2012|, respectively. We re-run the method of I Andriluka 



et ah, 20091 and I Yang and Ramanan, 2011 1 on the datasets 



and report the results, and other data entries in the table are 
from the ori ginal papers, respectively. Please note that |Tian 



et ah, 2012) tried two different settings in their methods, but 



they did not report the result using all the 1000 images in the 
training set. 

Overall our method achieved the best performance. In the 
PARSE dataset, our method is marginally better than the orig- 
inal algorithm (75.7% vs 74.9% ). This is possibly because 
the power of our visual category training may not be fully 
explored due to small number of training samples. 

The problem caused by limited training samples is relieved 
in the LSP dataset. When 1000 images are used for train- 
ing, our method outperforms other methods. Compared to 
four methods, our total detection accuracy (65.2%) i s consis- 



tently higher. Our p erformance is also superior to I Johnson 



|and Everingham, 201 1) , where 11000 samples were used for 
training^] This suggests that our training is effective. 

Fig. B] shows some examples of parsing results for three 



metho d s. Each triplet con tains results for I Andriluka et al, 
20091, [Yang and Raman an, 201 1| and ours. The parsing re- 
sults show that our method produces visually pleasing results. 
The interaction between high level and low level composi- 
tional parts makes our results more "balanced". For instance, 

'Due to a large number of missing labels in this dataset, we do 
not perform our evaluation on it. In their method, the training sam- 
ples were automatically relabeled during optimization. 




Figure 4: Result comparison. Each triplet of image contains results for [Andriluka et al, 2009] in its original visualization, 
[Yang and Ramanan, 201 1] and ours using the visualization protocol in [Johnson and Everingham, 201 1] . 



Exp. 


Method 


Torso 


Head 


Upper Leg 


Lower Leg 


U.Arm 


L.Arm 


Total 


PARSE 


1 Yang and Ramanan, 201 1 1 


97.6 


93.2 


83.9 


75.1 


72.0 


48.3 


74.9 




Ours 


97.1 


90.2 


86.1 


77.1 


74.9 


46.9 


75.7 


LSP 


1 Yang and Ramanan, 201 1 1 


92.6 


87.4 


66.4 


57.7 


50.0 


30.4 


58.9 




iTian etal., 20121 (first 200) 


93.7 


86.5 


68.0 


57.8 


49.0 


29.2 


58.8 




[Tian et al, 20121 (5 models) 


95.8 


87.8 


69.9 


60.0 


51.9 


32.9 


61.3 




~~ 1 Johnson and Everingham, 2010 T 


78.1 


62.9 


65.8 


58.8 


47.4 


32.9 


55.1 




[Johnson and Everingham, 201 1 V 


88.1 


74.6 


74.5 


66.5 


53.7 


37.5 


62.7 




lAndrilukaefa/., 20091 


76.8 


68.5 


56.9 


48.8 


37.4 


20.0 


47.1 




Ours 


92.2 


84.7 


78.1 


67.5 


54.7 


37.2 


65.2 



Table 1: Performance on the PARSE and the LSP dataset. The first two rows shows the performance comparison on the 
PARSE dataset against [Yang and Ramanan, 201 1]. The next seven rows show the performance of five algorithms on the more 
challenging LSP dataset. 



detection results for two legs are well separated if the cor- 
responding symbols of the lower body part are detected, be- 
cause the symbol-wise geometric context naturally guides the 
maximization (Eq. [6]l to this optimal solution. Our method 
also tries to make the best guess of self-occluded parts. 

The training takes approximately 8 hours on a 2.8GHz 
Quad Core CPU with 6GB memory. Test takes 1.5s for an 
320 x 240 image. Compared to models where loopy BP is 
used, our tree structure essentially speeds up the training. The 
running time of our method is in the same order of magnitude 
of [Yang and Ramanan, 201 1 1. Therefore, our method strikes 
a balance between accuracy from long range interaction and 
the efficiency from exact inference. 

5 Conclusion 

This paper presents a novel approach to learning self- 
contained representations for parsing human poses in images. 
The main contribution is the visual symbols that facilitate ge- 
ometric context modeling. Our method can be used for many 



graphical models. When the model is a tree, we demonstrate 
that our method outperforms four current methods. 
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